A Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost
نویسندگان
چکیده
3D-stacked DRAM alleviates the limited memory bandwidth bottleneck that exists in modern systems, by leveraging through silicon vias (TSVs) to deliver higher external memory channel bandwidth. Today’s systems, however, cannot fully utilize the higher bandwidth offered by TSVs, due to the limited internal bandwidth within each layer of the 3D-stacked DRAM. We identify that the bottleneck to enabling higher bandwidth in 3D-stacked DRAM is now the global bitline interface, the connection between the DRAM row buffer and the peripheral IO circuits. The global bitline interface consists of a limited and expensive set of wires and structures, called global bitlines and global sense amplifiers, whose high cost makes it difficult to simply scale up the bandwidth of the interface within a single DRAM layer in the 3D stack. We alleviate this bandwidth bottleneck by exploiting the observation that several global bitline interfaces already exist across the multiple DRAM layers in current 3D-stacked designs, but only a fraction of them are enabled at the same time. We propose a new 3D-stacked DRAM architecture, called Simultaneous Multi-Layer Access (SMLA), which increases the internal DRAM bandwidth by accessing multiple DRAM layers concurrently, thus making much greater use of the bandwidth that the TSVs offer. To avoid channel contention, the DRAM layers must coordinate with each other when simultaneously transferring data. We propose two approaches to coordination, both of which deliver four times the bandwidth for a four-layer DRAM, over a baseline that accesses only one layer at a time. Our first approach, Dedicated-IO, statically partitions the TSVs, by assigning each layer to a dedicated set of TSVs that operate at a higher frequency. Unfortunately, Dedicated-IO requires a non-uniform design for each layer (increasing manufacturing costs), and its DRAM energy consumption scales linearly with the number of layers. Our second approach, Cascaded-IO, solves both issues by instead time multiplexing all of the TSVs across layers. Cascaded-IO reduces DRAM energy consumption by lowering the operating frequency of higher layers. Our evaluations show that SMLA provides significant performance improvement and energy reduction across a variety of workloads (55%/18% on average for multi-programmed workloads, respectively) over a baseline 3D-stacked DRAM, with low overhead.
منابع مشابه
Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface
Limited memory bandwidth is a critical bottleneck in modern systems. 3D-stacked DRAM enables higher bandwidth by leveraging wider Through-Silicon-Via (TSV) channels, but today’s systems cannot fully exploit them due to the limited internal bandwidth of DRAM. DRAM reads a whole row simultaneously from the cell array to a row buffer, but can transfer only a fraction of the data from the row buffe...
متن کاملPicoServer Revisited: On the Profitability of Eliminating Intermediate Cache Levels
The confluence of 3D stacking, emerging dense memory technologies, and low-voltage throughput-oriented manycore processors has sparked interest in single-chip servers as building blocks for scalable data-centric system design. These chips encapsulate an entire memory hierarchy within a 3D-stacked multi-die package. Stacking alters key assumptions of conventional hierarchy design, drastically in...
متن کاملResource Management Design in 3D-Stacked Multicore Systems for Improving Energy Efficiency
Technology scaling and increasing power densities have led to a transition from single-core to multi-core processors, and the trend is now moving towards many-core architectures. Hundreds of millions of transistors can now be integrated on a single chip, however, they cannot be fully exploited due to interconnect/memory latency, power consumption, and yield related challenges. 3D integration is...
متن کاملWhen to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads
Response time requirements for big data processing systems are shrinking. To meet this strict response time requirement, many big data systems store all or most of their data in main memory to reduce the access latency. Main memory capacities have grown, and systems with 2 TB of main memory capacity available today. However, the rate at which processors can access this data—the memory bandwidth...
متن کاملEnabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions
Performance improvements from DRAM technology scaling have been lagging behind the improvements from logic technology scaling for many years. As application demand for main memory continues to grow, DRAM-based main memory is increasingly becoming a larger system bottleneck in terms of both performance and energy consumption. A major reason for poor memory performance and energy efficiency is me...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015